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Abstract 

One of the central problems in statistics and machine learning is regression: Given values of input variables, called 
features, develop a model for an output variable, called a response or task. In many settings, there are potentially thousands 
0^ of possible features, so that feature selection is required to reduce the number of predictors used in the model. Feature 

selection can be interpreted in two broad ways. First, it can be viewed as a means of reducing prediction error on unseen 
test data by improving model generalization. This is largely the focus within the machine-learning community, where the 
primary goal is to train a highly accurate system. The second approach to feature selection, often of more interest to 
scientists, is as a form of hypothesis testing: Assuming a "true" model that generates the data from a small number of 
features, determine which features actually belong in the model. Here the metrics of interest are precision and recall, more 
^5 than test-set error. 

Many regression problems involve not one but several response variables. Often the responses are suspected to share 
a common underlying structure, in which case it may be advantageous to share information across the responses; this 
is known as multitask learning. As a special case, we can use multiple responses to better identify shared predictive 
features — a project we might call multitask feature selection. 

This thesis is organized as follows. Section 1 introduces feature selection for regression, focusing on regularization 
methods and their interpretation within a Minimum Description Length (MDL) framework. Section 2 proposes a novel 
extension of MDL feature selection to the multitask setting. The approach, called the "Multiple Inclusion Criterion" (MIC), 
t/3 is designed to borrow information across regression tasks by more easily selecting features that are associated with multiple 

responses. We show in experiments on synthetic and real biological data sets that MIC can reduce prediction error in 
settings where features are at least partially shared across responses. Section 3 surveys hypothesis testing by regression with 
a single response, focusing on the parallel between the standard Bonferroni correction and an MDL approach. Mirroring 
^ the ideas in Section 2, Section 4 proposes a novel MIC approach to hypothesis testing with multiple responses and shows 

that on synthetic data with significant sharing of features across responses, MIC outperforms standard FDR-controlling 
methods in terms of finding true positives for a given level of false positives. Section 5 concludes. 

O 

o 

\Q 1 Feature Selection with a Single Response 

On The standard linear regression model assumes that a response y is generated as a Hnear combination of m predictor variables 
("features") xi, . . ., Xm with some random noise: 



o 



P^Xi+ j32X2 + ... + PmXm + f; £ 7V(0, CT^) , (1) 



> 

X 

$H where we assume the first feature xi is an intercept whose value is always 1. Given n observations of the features and 
responses, we can write the y values in an n x 1 vector Y and the x values in an n x m matrix X. Assuming the observations 
are independent and identically distributed, (1) can be rewritten as 

y = A/3 + e, e~AA„(0,a2/„x„), (2) 
where (3 — ... , Mn denotes the n-dimensional Gaussian distribution, and /„xn is the n x n identity matrix. 

Pin 

The maximum-likelihood estimate f3 for /? under this model can be shown to be the one minimizing the residual sum of 
squares: 

RSS := (y- A/3)'(r- A/3). (3) 

The solution is given by 

(3 = {x'xy^ X'Y (4) 
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and is called the ordinary least squares (OLS) estimate for 

In some cases, we may want to restrict the number of features in our model to a subset of q of them, including the 
intercept. In this case, we pretend that our X matrix contains only the relevant q columns when applying (4); the remaining 
m — q entries of the /3 matrix are set to 0. I'll denote the resulting estimate by (3q, and the RSS for that model by 

RSS, := (y-X/3g)'(r-X/3,). (5) 

1.1 Penalized Regression 

In many cases, regression problems have large numbers of potential features. For instance, [FS04] predicted credit-card 
bankruptcy using a model with more than m — 67,000 potential features. In bioinformatics applications, it is common to 
have thousands or tens of thousands of features for, say, the type of each of a number of genetic markers or the expression 
levels of each of a number of gene transcripts. The number of observations, in contrast, is typically a few hundred at best. 

The OLS esimate (4) breaks down when m > n, since the mxm matrix X' X is not invertible due to having rank at most 
n. Moreover, it's implausible that a given response is actually linearly related to such a large number of features; a model (3 
with lots of nonzeros is probably overfitting the training data. 

The statistics and machine-learning communities have developed a number of approaches for addressing this problem. 
One of the most common is regularized regression, which aims to minimize not (3) directly, but a penalized version of the 
residual sum of squares: 

{Y-X(3y{Y-Xf3) + X\\(3\\^, (6) 

where represents the £p norm of /3 and A is a tunable hyperparameter, to be determined by cross-validation or more 

sophisticated regularization path approaches.^ 

Ridge regression takes the penalty as proportional to the (square of the) £2 norm: A \\P\\2- This corresponds to a Bayesian 
maximum a posteriori estimate for (3 under a Gaussian prior f3 ~ Mifimxi, ^^mxm), with as in (1) [Bis06, p. 153]. Under 
this formulation, we have 

P= iX'X+XImy,rr.)-'X'Y, 

which is computationally valid because X'X + Xlmxm is invertible for A > 0. However, because the square of a decimal 
number less than one is much smaller than the original number, the £2 norm offers little incentive to drive entries of (3 to 
— many of them just become very small. 

Another option is to penalize by the £1 norm, which is known as lasso regression and is equivalent to a double-exponential 
prior on /3 [Tib96]. Unlike £2, £1 regularization doesn't square the coefficients, and hence the entries of /? tend to be sparse. 
In this way, £1 regression can be seen as a form of feature selection — i.e., choosing a subset of the original features to keep in 
the model (e.g., [BL97]). Sparsity helps to avoid overfitting to the training set; as a result, the number of training examples 
required for successful learning with £1 regularization grows only logarithmically with the number of irrelevant features, 
whereas this number grows linearly for £2 regression [Ng04]. Sparse models also have the benefit of being more interpretable, 
which is important for scientists who want to know which particular variables are actually relevant for a given response. 
Building regression models for interpretation is further discussed in Sections 3 and 4. 

If £2 regression fails to achieve sparsity because coefficients are squared, then, say, £1 regression should achieve even more 
sparsity than £1. As p approaches 0, approaches the number of nonzero values in /3. Hence, regularization with what is 
called the "£o norm" is subset selection: Choosing a small number of the original features to retain in the regression model. 
Once a coefficient is in the model, there's no incentive to drive it to a small value; all that counts is the cost of adding it in 
the first place. The £0 norm has a number of advantages [LPFU08], including bounded worst-case risk with respect to the £1 
norm and better control of a measure called the "false discover rate" (FDR), explained more fully in Section 3. Moreover, as 
[OTJ09, p. 1] note, "A virtue of the [^o] approach is that it focuses on the qualitative decision as to whether a covariate is 
relevant to the problem at hand, a decision which is conceptually distinct from parameter estimation." However, they add, 
"A virtue of the [£i] approach is its computational tractability." Indeed, exact £0 regularization requires subset search, which 
has been proved NP-hard [Nat95]. In practice, therefore, an approximate greedy algorithm like forward stepwise selection is 
necessary [BL05, p. 582]. 

In a regression model, residual sum of squares is proportional up to an additive constant to the negative log-likelihood of 
p. Therefore, £q regularization can be rephrased as a penalized likelihood criterion (with a different A than in (6)): 

-21nF(y|/3,) + Ag, (7) 

^See, e.g., [FriOS] for an excellent introduction. 



2 



where q is the number of features in the model, and P{Y \ (3q) is the likelihood of the data given a model containing q features. 
As noted in [GFOO], statisticians have proposed a number of choices for A, including 

• A = 2, corresponding to Mallows' Cp []\Ial73], or approximately the Akaike Information Criterion (AIC) [Aka73], 

• A = Inn, the Bayesian Information Criterion (BIC) [Sch78], and 

• A = 21nTO, the Risk Inflation Criterion (RIC) [DJ94, FG94]. 

It turns out that each of these criteria can be derived from an information-theoretic principle called Minimum Description 
Length (MDL) under different "model coding" schemes (e.g., [FS99, HY99]). MDL forms the focus of this thesis, and its 
approach to regression is the subject of the next subsection. 



1.2 Regression by Minimum Description Length 

MDL [Ris78, Ris99] is a method of model selection that treats the best model as the one which maximally compresses a 
digital representation of the observed data. We can imagine a "Sender" who wants to transmit some data in an email to a 
"Receiver" using as few bits as possible [Sti04]. In the case of linear regression, we assume that both Sender and Receiver 
know the n x m matrix X, and Sender wants to convey the values in the n x 1 matrix Y . One way to do this would be to 

However, if the response is correlated with some features. 



send the raw values for each of the n observations Y 



Vn 



it may be more efficient first to describe a regression model j3 for Y given X and then to enumerate the residuals Y — X(3, 
which — having a narrower distribution — require fewer bits to encode. 

To minimize description length, then. Sender should choose /3 to minimize 



V{Y\P) + V{P), 



(8) 



where the first term is the description length of the residuals about the model, and the second term is the description length 
of the model itself.^ Exactly what these terms mean will be elaborated below. 

1.2.1 Coding the Data: V{Y \ P) 

The Kraft inequality in information theory [CT06, sec. 5.2] implies that for any probability distribution {pi} over a finite or 
countable set, there exists a corresponding code with codeword lengths [— Igp^]. Moreover, these code lengths are optimal 
in the sense of minimizing the expected code length with respect to {pi}- If Sender and Receiver agree on a model for the 
data, e.g., (1), then they have a probability distribution over possible configurations of residuals e, so they will agree to use 
a code for the residuals with lengths 



I?(y|/?)--lgP(e|/3) = -lgP(r|/3), 



(9) 



that is, the negative log-likelihood of the data given the model. This is a standard statistical measure for poorness of fit.'^ 

Consider a stepwise-regression scenario in which our model currently contains q — 1 features (including the intercept term), 
and we're deciding whether to include an extra g"^ feature. Let Yi denote the i"^ entry (row) of Y and Pq a model with all q 
features. (2) and (9) give 

n 

V{Y\Pq)^-lgl[P{Y,\Pq) 



1 



: exp 



(10) 



21n2 



nln(2w2)_^RSS, 



(7^ 



^This is what [Gru05, p. 11] calls the "crude, two-part version of MDL." Beginning with [Ris86], Rissanen introduced a "one-part" version 
of MDL based on a concept called "stochastic complexity." However, it too divides the description length into terms for model fit and model 
complexity, so it is in practice similar to two-part MDL [StiOl]. 

^We use "idealized" code lengths and so drop the ceiling on — Ig P{Y \ /3) [BY98, p. 2746] . Also, since P{Y | /3) is a continuous density, we would 
in practice need to discretize it. This could be done to a given precision with only a constant impact on code length [CT06, sec. 8.3]. 
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where RSSg is as in (5). Of course, is in practice unknown, so we estimate it by^ 



^2 RjSS(7 
a = . 



(11) 



Inserting this estimate for in (10) gives 



1 



21n2 

n 
21n2 



[nln(27rCT^) + 
V n 



1 



(12) 



Unfortunately, in practice, (11) seems to cause overfitting, especially when used for multiple responses as in Section 2.2.2. 
The overfitting comes from the fact that ct^ assumes that the current (f^ feature actually comes into the model; if this is not 
the case, the estimated variance is spuriously too low, which allows random noise to appear more unlikely than it really is. 
To prevent overfitting, we use instead an estimate of based on the model without the current feature: (3q-i- That is. 



n ^ — ' n 



which means that (10) becomes instead 



2 In 2 



In 



/27rRSSg_i 



RSSq 



RSS 
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(13) 



(14) 



1.2.2 Coding the Model: V{(3) 

Just as I?(y I f3) depends on the model for the residuals that Sender and Receiver choose, so their coding scheme for (3 itself 
will reflect their prior expectations.^ When the number of features m is large (say, 1000), Sender will likely only want to 
transmit a few of them that are most relevant, and hence the /3 matrix will contain mostly zeros. ^ So the flrst step in coding 
(3 could be to say where the nonzero entries are located; if only a few features enter the model, this can be done relatively 
efficiently by listing the indices of the features in the set {1,2, . . . ,m}. This requires [Igm] bits, which we idealize as just 
Ig m. 

The second step is to encode the numerical values of those coefficients. Rissanen [Ris83] suggested the basic approach 
for doing this: Create a discrete grid over some possible parameter values, and use a code for integers to specify which grid 
element is closest.^ [Sti04] described a simple way to approximate the value of a particular coefficient /3: Encode an integer 
version of its z-score relative to the null-hypothesis value Pq (which in our case is 0): 



/ p-/3o \ ^ / ^ \ 
\SE0)/ \SE(^)/' 



(15) 



where (x) means "the closest integer to x." The z-score can then be coded with the idealized universal code for the positive 
integers of [Eli75] and [Ris83], in which the cost to code i g {1,2, 3, . . .} is 

Is* « + b, 



* This is the maximum-hkehhood estimate for a^, which Sender uses because, ignoring model-coding cost, maximizing hkehhood is equivalent 
to minimizing description length. In practice, of course, many statisticians use the unbiased estimate 

n — q 

[Sti04, Appendix A] makes an interesting case for this latter value within the MDL framework. However, we use the maximum-likelihood estimate 
for its elegance and convenience. 

^Again by the Kraft inequality, we can interpret 2"-^"'' as a prior over possible models /3. In fact, this is done explicitly in the Minimum 
Message Length (MML) principle (e.g., [Wal05]), a Bayesian sister to MDL which chooses the model /3 with maximum P{f5 \ Y), i.e., the model 
that minimizes 

- Ig P(l3 I y ) = - Ig P(/3) - Ig P(y I /3) + const. 



^For convenience, we assume that Sender always transmits the intercept coefficient /3i and that, in fact, doing so is free of charge. Thus, even 
if Sender were to encode all other feature coefficients as 0, Receiver would at least have the average Y of the Y values to use in reconstructing Y. 

'^Thus, Sender transmits only rounded coefficient values. Rounding adds to the residual coding cost because Sender is not specifying the exact 
MLE, but the increase tends to be less than a bit [Ris89]. 
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where Ig* i := Ig z + Ig Ig z + Ig Ig Ig i + . . . so long as the terms remain positive, and b w Ig 2.865 w 1.516 [Ris83, p. 424] is the 
constant such that 

oo 
i=l 

We require the Ig* instead of a simple Ig because the number of bits Sender uses to convey the integer i will vary, and she 
needs to tell Receiver how many bits to expect. This number of bits is itself an integer that can be coded, hence the iteration 
of logarithms. The middle row of Table 1 shows example costs with this code. 



i 


1 


2 


3 


4 


5 


10 


100 


Universal-code cost for i 


1.5 


2.5 


3.8 


4.5 


5.3 


7.4 


12.9 


Universal-style cost for i, truncated at 1000 


1.2 


2.2 


3.4 


4.2 


5.0 


7.0 


12.6 



Table 1: Example costs of the universal code for the integers. 



In fact, in practice it's unnecessary to allow our integer code to extend to arbitrarily large integers. We're interested in 
features near the limit of detectability, and we expect our z-scores to be roughly in the range 2 to 4, since if they were 
much higher, the true features would be obvious and wouldn't require sensitive feature selection. We could thus impose some 
maximum possible z-score Z that we might ever want to encode (say, 1000) and assume that all of our z-scores will fall below 
it. In this case, the constant c can be reduced to a new value cz, now only being large enough that 

z 

^2-(ig''+=2) ^ 1. (16) 

i=l 

In particular, ciooo ~ 1.199, which is reflected in the reduced costs shown in the third row of Table 1. As it turns out, ~ 1 
over a broad range (say, for Z e {5, . . . , 1000}). 

In our implementation, we avoid computing the actual values of our z-scores (though this could be done in principle), 
instead assuming a constant cost of 2 bits per coefficient. Though MDL theoretically has no tunable parameters once Sender 
and Receiver have decided on their models, the number of bits to specify a coefficient can act as one, having much the same 
effect as the significance level a of a hypothesis test (see Section 3.3 for details). We found 2 bits per coefficient to work well 
in practice. 

1.2.3 Summary 

Combining the cost of the residuals with the cost of the model gives the following formula for the description length as a 
function of the number of features q that we include in the model: 

-lgP(y|^,) + g(lgm + 2). (17) 

Note the similarity between (17) and the RIC penalty for (7). 



2 MIC for Prediction 

Standard multivariate linear regression as described in Section 1 assumes many features but only a single response. However, 
a number of practical problems involve both multiple features and multiple responses. For instance, a biologist may have 
transcript abundances for thousands of genes that she would like to predict using genetic markers [KCY+06], or she may have 
multiple diagnostic categories of cancers whose presence or absence she would like to predict using transcript abundances 
[KWR+01]. [BF97] describe a situation in which they were asked to use over 100 econometric variables to predict stock prices 
in 60 different industries. 

More generally, we suppose we have h response variables y i, ... ,yh to be predicted by the same set of features X2, ■ . . ,Xp 
(with xi being an intercept, as before). In parallel to (2), we can write 

Y = Xf3 + e, (18) 

where now Y is aii n x h matrix with each response along a column. X remains n x m, while (3 \s m x h and e is n x ft,. 
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In general, the noise on the responses may be correlated; for instance, if our responses consist of temperature measurements 
at various locations, taken with the same thermometer, then if our instrument drifted too high at one location, it will have 
been too high at the other. A second, practical reason we might make this assumption comes from our use of stepwise 
regression to choose our model: In the early iterations of a stepwise algorithm, the effects of features not present in the model 
show up as part of the "noise" error term, and if two responses share a feature not yet in the model, the portion of the error 
term due to that feature will be the same. Thus, we may decide to take the rows of e have nonzero covariance: 

e,^Mh{0,^), (19) 

where is the ith row of e, and E is an arbitrary (not necessarily diagonal) h x h covariance matrix. In practice, however, 
we end up using a diagonal estimate of S, as discussed in Section 2.2.2. 

2.1 Multitask Learning 

The naive approach to regression with the above model is to treat each response as a separate regression problem and 
calculate each column of /? using just the corresponding column of Y. This is completely valid mathematically; it even 
works in the case where the covariance matrix S is non-diagonal, because the maximum-likelihood solution for the mean of 
a multivariate Gaussian is independent of the covariance [Bis06, p. 147]. However, intuitively, if the regression problems are 
related, we ought to be able to do better by "borrowing strength" across the multiple responses. This concept goes by various 
names; in the machine-learning community, it's often known as "multitask learning" (e.g., [Car97]) or "transfer learning" 
(e.g., [RNK06]). 

Several familiar machine-learning algorithms naturally do multitask learning; for instance, neural networks can share 
hidden-layer weights with multiple responses. In addition, a number of explicit multitask regression methods have been 
proposed, such as curds and whey, which takes advantage of correlation among responses [BF97], and various dimensionality- 
reduction algorithms (e.g., [AM05]). 

Some techniques even do explicit feature selection. For instance, [BVF98, BVF02] present Bayesian Markov Chain Monte 
Carlo approaches for large numbers of features. [ST05] present a Multiresponse Sparse Regression (MRSR) algorithm that 
selects features in a stepwise manner by choosing at each iteration the feature most correlated with the residuals of the 
responses from the previous model. ^ We have not explored the above algorithms in detail, but comparing them against MIC 
could be a fruitful direction for future work. Below we elaborate on two multitask-learning approaches that we have used. 

2.1.1 AndoZhang 

Ando and Zhang [AZ05] aim to use multiple prediction problems to learn an underlying shared structural parameter on the 
input space. Their presentation is general, but they give the particular example of a linear model, which we'll rewrite using 
the notation of this thesis. Given h tasks and m features per task, the authors assume the existence of an if x m matrix*^ Q, 
with H < m, such that for each task k — I, . . . ,h, 

yk ^ x/3'' + xe'p^, 

where superscript k denotes the fc*'^ column and (3q is an _ff x /i matrix of weights for the "lower dimensional features" created 
by the product XQ' (p. 1824). The transformation O is common across tasks and is thus the way we share information. The 
authors develop an algorithm, referred to as AndoZhang in this thesis, that chooses (p. 1825) 



,el = argmin V f - llr*^ - X (/J*^ + e'/3?-,)f + Xk \\f3^\f'] such that 66' = Ihxh- 



Here {Afcj^'^j^ are regularization weights, and (unlike Ando and Zhang) we've assumed a constant number n of observations 
for each task, as well as a quadratic loss function. ||-||^ denotes the square of the I2 norm. 

Ando and Zhang present their algorithm as a tool for semi-supervised learning: Given one particular problem to be solved, 
it's possible to generate auxiliary problems from the known data, solve them, and then use the shared structural parameter 
6 on the original problem. Of course, the approach works just as well when we start out with all of the tasks as problems 
to be solved in their own right. 



"Another sparse-regression algorithm is BBLasso, described in Section 2.1.2 below. 

^The authors use h instead of H, but we have already used that variable to denote the number of tasks. 
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2.1.2 BBLasso 



A few multitask learning methods actually do feature selection, i.e., building models that contain only a subset of the original 
features. AndoZhang, with its ^2 regularization penalty, leaves in most or all of the features, but £i methods, such as BBLasso 
presented below, do not. 

[AEP08] present an algorithm for learning functions of input features across tasks. This is more general than the approach 
considered in this thesis of choosing merely a subset of the original features, but [AEP08, sec. 2.2] note that their approach 
reduces to feature selection with an identity function. Regularizing with an ii norm over features, their problem thus becomes 



where RSS/3 — trace [{Y — Xf3y{Y — X(3)] is the residual sum of squares for all h responses, and Pi denotes the i^^ row of /3. 
The i*"^ term in the sum represents the magnitude of coefficients assigned to feature i; the sum over these values for each i 
amounts to an ii norm over the rows. 

The £2 norm favors shared coefficients across responses for the same features. To see this, [AEP08, p. 6] suggest the case 
in which the entries of (3 can only be or 1. li h different features each have a single nonzero response coefficient, the penalty 
will be ^^j^i 1 = A/i, since \\(3i\\ = 1 for each such feature i. However, if a single feature shares a coefficient of 1 across all 

h responses, the penalty is only A-^^^^-^ 1^ = X\/h. The same number and magnitude of nonzero coefficients thus leads to 
a much smaller penalty when the coefficients are shared. 

In principle the penalty term can be made even smaller by including nonzero coefficients for only some of the responses. 
For instance, in the example above, the penalty would be \/h/2 if just half of the responses had nonzero coefficients (i.e., 
coefficients of 1). However, in reality, the coefficient values are not restricted to and 1, meaning that unhelpful coefficients, 
rather than being set to 0, tend to get very small values whose square is negligible in the ^2-norm penalty. In practice, then, 
this algorithm adds entire rows of nonzero coefficients to (3. 

Of course, as [OTJ09, p. 4] note, the £2 norm used here could be generalized to an £p norm for any p G [l,cx)). p — 1 
corresponds to "no sharing" across the responses, since (20) would then reduce to the objective function for h independent 
£i-penalized regressions. At the opposite extreme, p = 00 corresponds to "maximal sharing" ; in that case, only the maximum 
absolute coefficient value across a row matters. 

A number of efficient optimization approaches have been proposed for (20) with p = 2 and p — 00; see, e.g., the related- 
work sections of [TVW05] and [ST07] for a survey. In particular, [OTJ09] propose a fast approximate algorithm using a 
regularization path, that is, a method which implicitly evaluates (20) at all values of A. 

2.2 Regression by MIC 

Following Section 1.2, this section considers ways we might approach regression with multiple responses from the perspective 
of information theory. One strategy, of course, would be to apply the penalized-likelihood criterion (17) to each response in 
isolation; we'll call this the RIC method, because the dominant penalty in (17) is Igm, the same value prescribed by the 
Risk Inflation Criterion for equation (7). 

However, following the intuition of BBLasso and other multitask algorithms, we suspect that features predictive of one 
task are more likely predictive of other, related tasks. For example, if two of our responses are "level of HDL cholesterol" and 
"level of LDL cholesterol," we expect that lifestyle variables correlated with one will tend to be correlated with the other. 
The following multitask coding scheme, called the Multiple Inclusion Criterion (MIC) , is designed to take advantage of such 
situations by making it easier to add to our models features that are shared across responses. 

2.2.1 Coding the Model 

The way MIC borrows strength across responses is to efficiently specify the feature-response pairs in the mx h matrix /3. The 
naive approach would be to put each of the mh coefficients in a linear order and specify the index of the desired coefficient 
using lg{mh) bits. But we can do better. If we expect nearly all the responses to be correlated with the predictive features, 
we could give all of the responses nonzero coefficients (using 2h bits to code the values for each of the h response coefficients) 
and simply specify the feature that we're talking about using Igm bits, as in Section 1.2.2. We'll call this the fully dependent 
MIC coding scheme (or Full MIC for short). It amounts to adding entire rows of nonzero coefficients at a time to our (3 
matrix, in much the same way that BBLasso does. 

In many cases, however, the assumption that each feature is correlated with almost all of the responses is unrealistic. 
A more flexible coding scheme would allow us to specify only a subset of the responses to which we want to give nonzero 
coefficients. For instance, suppose we're considering feature number 703; of the ft, = 20 responses, we think that only responses 




(20) 
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{2, 5, 11, 19} should have nonzero coefRcients with the current feature. We can use Igm bits to specify our feature (number 
703) once, and then we can hst the particular responses that have nonzero coefficients with feature 703, thereby avoiding 
four different lg{mh) penalties to specify each coefficient in isolation. 

A standard way to code a subset of a set of size h is to first specify how many k < h elements the subset contains and 
then which of the ('^) possible subsets with k elements we're referring to [CT06, sec. 7.2]. In particular, we choose to code k 
using 

Ig* k + ch (21) 

bits, with Ch as defined in (16).^" We then need Ig (^) additional bits to specify the particular subset. We refer to this code 
as partially dependent MIC (or Partial MIC for short). 



2.2.2 Coding the Data 



As usual, T){Y \ (3q) — — \gP{Y \ (3q) for the model that Sender and Receiver choose, which in this case is (15 
rows of e to have arbitrary covariance as in (19), then 



21n2 



n I 

tin {{2nf\n) + Y.[^-- - 



If we allow 



(22) 



with subscript i denoting the i^^ row. Since E is in fact unknown, we estimate it using maximum likelihood: 



(23) 



where the subscript F stands for "full covariance," and we use /3<j-i instead of /3<j to prevent overfitting, as in the single- 
response case of Section 1.2.1. 

In practice, we find that estimating all entries of S leads to overfitting. Therefore, in our experiments, we estimate S 
using a diagonal matrix S/j that's the same as Si? except that the off-diagonal entries have been set to 0. In this case, (22) 
reduces to a sum of h terms like (14) for each response separately. We also experimented with shrunken estimates of the form 
E> = AEd + (1 — X)'Sp for A e [0, 1] and observed informally that for A > 0.5 or so, S> performed comparably with E^). 
However, there did not appear to be a performance advantage, so we continued to use S/^, which is faster computationally.^^ 



2.2.3 Comparing the Coding Schemes 

This section has discussed three information-theoretic approaches to multitask regression: RIC, Full MIC, and Partial MIC. 
In general, the negative log-likelihood portion of RIC may differ from that of the other two, because Full and Partial MIC 
may use a nondiagonal covariance estimate like "Spj while RIC, only operating on one response at a time, implicitly uses 
However, since we also use S^j for Full and Partial MIC, the real differences come from the coding penalties. 

These are compared in Table 2 for various values of fc, the number of responses for which we add the current feature 
under consideration. Of course, Full MIC is only allowed to take fc = or fc = /i, so it actually has h nonzero coefficients in 
all three rows of the table. However, if the extra h — k coefficients correspond to non-predictive features, the extra reduction 
in residual-coding cost that Full MIC enjoys over the other methods is likely to be small. 

The numbers beside the coding-cost formulas illustrate the case m = 2,000 features and h = 20 responses. As expected, 
each coding scheme is cheapest in the case for which it was designed; however, we note that the MIC methods are never 
excessively expensive, unlike RIC for k — h. 



2.3 Implementation 

As mentioned in Section 1.1, searching over all possible combinations of zero and nonzero coefficients for the model that 
minimizes description length is computationally intractable. We therefore implement MIC using a forward stepwise search 
procedure, detailed in Algorithm 2.1. Beginning with a null model containing only an intercept feature, we evaluate each 
feature for addition to this model and add the one that would most reduce description length, as computed using Algorithm 

^"We could also assume a uniform distribution on {1,2, . . . , h} and spend Ig h bits to code fc's index. However, in practice we find that smaller 
values of k occur more often, so that the lg*-based code is generally cheaper. 

^^We also experimented with two other regularized full covariance-matrix estimators, proposed in [LWOl] and [SORS], respectively. While we 
never ran systematic comparisons, our informal assessment was that these methods generally did not improve performance relative to our current 
approach and sometimes reduced it. 
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Table 2: Costs in bits for each of the three schemes to code the appearance of a feature in fc = 1, fc = |, and k — h response 
models. In general, we assume m ^ /i ^ 1. Note that for h S {5, . . . , 1000}, Ch ~ 1- Examples of these values for m = 2,000 
and h — 20 appear in brackets; the smallest of the costs appears in bold. 



k 


Partial MIC 




Full MIC 




RIC 




1 


Ig m + Ch + lgh + 2 


[18.4] 


lgm + 2h [51.0] 


Ig m - 


-2 


[13.0] 


h 
4 


lgm + lg*(|)+c,+lg(4J- 


f 1 [39.8] 


lgm + 2h [51.0] 


f Igm 


. h 
^ 2 


[64.8] 


h 


lgm + \g* h + Ch + 2h 


[59.7] 


lgm + 2h [51.0] 


hlgm- 


V2h 


[259.3] 



2.2. Updating our model, we re-evaluate each remaining feature and add to the model the next-best feature. We continue 
until we find no more features that would reduce description length if added to the model. 



Algorithm 2.1: StepwiseRegressionMIC(X, F, method) 

/ / "method" is either "partially dependent MIC" or "fully dependent MIC." 
Make sure that the matrix X contains a column of I's for the intercept. 
Initialize the model [3 to have nonzero coefficients only for the intercept terms. That is, if [3 

is an m X h matrix, exactly the first row has nonzero elements. 
Compute E based on the residuals Y — X(3 (whether as T,p, E^), or something in between), 
while true 

'Find the feature / that, if added, would most reduce description length. 

(If method is "fully dependent MIC," then "adding a feature" means making 
nonzero the coefficients for all of the responses with that feature. If method is "partially 
dependent MIC," then it means making nonzero the optimal subset of responses, 
where the "optimal subset" is approximated by a stepwise-style search 
^ through response subsets.) 
Let (3f denote the model that would result if this new feature were added, 
if DescrLength(X, Y, (3f, E, method) < DescrLength(X, Y, (3, E, method) 

then P^^/^ 
[Update E. 

else break 

return (/?) 



Algorithm 2.2: DescrLength(X, F, /?, E, method) 

Let m be the number of features (columns of X) and h the number of responses (columns of Y). 
d ^ II Initialize d, the description length, 
d ^ d + V{Y 1/3) // Add the fikefihood cost using (22). 
for j ^ 1 to m // Add the cost of feature j, if any. 
' k <^ number of nonzero responses for feature j 
if /c> 

'd^d-flgm // Cost to specify which feature. 
(i^(i + 2fc // Cost to specify nonzero coefficients, 
if method == "partially dependent MIC" 
then d ^ d -I- Ig* fc + c/i + Ig (^) // Specify which subset of responses. 



do < 



then 



return (d) 



The above procedure requires evaluating the quality of features 0(mr) times, where r is the number of features eventually 
added to the model. For Partial MIC, each time we evaluate a feature, we have to determine the best subset of responses 
that might be associated with that feature. As Algorithm 2.1 notes, this is also done using a stepwise-style search: Start with 
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no responses, add the best response, then re-evaluate the remaining responses and add the next-best response, and so on.^^ 
Unlike an ordinary stepwise algorithm, however, we don't terminate the search if, at the current number of responses in the 
model, the description length fails to be the best seen yet. Because we're interested in borrowing strength across responses, 
we need to avoid overlooking cases where the correlation of a feature with any single response is insufficiently strong to 
warrant addition, yet the correlations with all of the responses are. Moreover, the Ig (^) term in Partial MIC's coding cost 
does not increase monotonically with k, so even if adding the feature to an intermediate number of response models doesn't 
look promising, adding it to all of them might. Thus, we perform a full 0{h?) search in which we eventually add all the 
responses to the model. As a result. Partial MIC requires a total of 0{mrh?) evaluations of description length. 
However, a few optimizations can reduce the computational cost of Partial MIC in practice. 

• We can quickly filter out most of the irrelevant features at each iteration by evaluating, for each feature, the decrease 
in negative log-likelihood that would result from simply adding it with all of its responses, without doing any subset 
search. Then we keep only the top t features according to this criterion, on which we proceed to do the full 0{h'^) 
search over subsets. We use t = 75, but we find that as long as i is bigger than, say, 10 or 20, it makes essentially no 
impact to the quality of results. This reduces the number of model evaluations to 0{rar + rth?). 

• We can often short-circuit the 0{h?) search over response subsets by noting that a model with more nonzero coefficients 
always has lower negative log- likelihood than one with fewer nonzero coefficients. This allows us to get a lower bound 
on the description length for the current feature for each number k G { 1 , . . . , ft,} of nonzero responses that we might 
choose as 

(model cost for other features already in model) 

-I- (negative log-likelihood of Y if all h responses for this feature were nonzero) (24) 
+ (the increase in model cost if k of the responses were nonzero) . 

We then need only check those values of k for which (24) is smaller than the best description length for any candidate 
feature's best response subset seen so far. In practice, with h — 20, we find that evaluating k up to, say, 3 or 6 is 
usually enough; i.e., we typically only need to add 3 to 6 responses in a stepwise manner before stopping, with a cost 
of only "ih to 6ft. 

Although we did not attempt to do so, it may be possible to formulate MIC using a regularization path, or homotopy, 
algorithm of the sort that has become popular for performing ii regularization without the need for cross-validation (e.g., 
[Fri08]). If possible, this would be significantly faster than stepwise search. 

2.4 Experiments 

We evaluate MIC on several synthetic and real data sets in which multiple responses are associated with the same set of 
features. We focus on the parameter regimes for which MIC was designed: Namely, n, but with only a relatively small 

number of features expected to be predictive. We describe the details of each data set in its own subsection below. 

For comparing MIC against existing multitask methods, we used the Matlab "Transfer Learning Toolkit" [KR], which 
provides implementations of seven algorithms from the literature. Unfortunately, most of them did not apply to our data 
sets, since they often required meta-features (features describing other features), or expected the features to be frequency 
counts, or were unsupervised learners. The two methods that did apply were AndoZhang and BBLasso, both described in 
Section 2.1. 

The AndoZhang implementation was written by the TL Toolkit authors, Wei-Chun Kao and Alexander Rakhlin. It 
included several free parameters, including the regularization coefficients Xk for each task (all set to 1, as was the default), 
the optimization method (also set to the default), and H. [AZ05, pp. 1838-39] tried values of H between 10 and 100, 
settling on 50 but finding that the exact value was generally not important. We performed informal cross-validation for 
values between 1 and 250 and found, perhaps surprisingly, that H = 1 consistently gave the best results. We used this value 
throughout the experiments below. 

The BBLasso implementation was written by Guillaume Obozinski, based on a paper [OTJ06] published earlier than the 
[OTJ09] cited above; however, the algorithm is essentially the same. For each parameter, we used the default package setting. 

AndoZhang and BBLasso are both classification methods, so in order to compare against them, we had to turn MIC into 
a classification method as well. One way would have been to update (22) to reflect a logistic model; however, the resulting 

^■^A stepwise search that rc-evaluates the quahty of each response at each iteration is necessary because, if we take the covariance matrix S to 
be nondiagonal, the values of the residuals for one response may affect the likelihood of residuals for other responses. If we take E to be diagonal, 
as we end up doing in practice, then an 0{h) search through the responses without re-evaluation would suffice. 

^^FuU MIC, not requiring the search through subsets of responses, is only 0{mr) in the number of evaluations of description length. 

^*If S is diagonal and we don't need to re-evaluate residual likelihoods at each iteration, the cost is only 3 to 6 evaluations of description length. 
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computation of (3 would then have been nonlinear and much slower. Instead, we continue to apply (22) and simply regress 
on responses that have the values or 1. Once we've chosen which features will enter which models, we do a final round of 
logistic regression for each response separately with the chosen features to get a slightly better classifier. 

2.4.1 Synthetic Data 

We created synthetic data according to three separate scenarios, which we'll call "Partial," "Full," and "Independent." For 
each scenario, we generated a matrix of continuous responses as 

with m — 2,000 features, h = 20 responses, and n = 100 observations. Then, to produce binary responses, we set to 1 those 
response values that were greater than or equal to the average value for their columns and set to the rest, yielding a roughly 
50-50 split between I's and O's because of the normality of the data. Each entry of Xgim was i.i.d. Af{0, 1), each nonzero 
entry of /3sim was i.i.d. Af{0, 1), and entry of esim was i.i.d. A/'(0,0.1), with no covariance among the Csim entries for different 
responses. Each response had 4 beneficial features, i.e., each column of /3sim had 4 nonzero entries. 
The scenarios differed according to the distribution of the beneficial features in /3sim- 

• In the Partial scenario, the first feature was shared across all 20 responses, the second was shared across the first 15 
responses, the third across the first 10 responses, and the fourth across the first 5 responses. Because each response 
had four features, those responses (6 — 20) that didn't have all of the first four features had other features randomly 
distributed among the remaining features (5, 6, . . . , 2000). 

• In the Full scenario, each response shared exactly features 1 — 4, with none of features 5 — 2000 being part of the model. 

• In the Independent scenario, each response had four random features among {1, . . . ,2000}. 
Figure 1 illustrates these feature distributions, showing the first 40 rows of random /3si,ji matrices. 



Partial Scenario Full Scenario Independent Scenario 




20 responses 20 responses 20 responses 



Figure 1: Examples of Psim for the three scenarios, with the nonzero coefficients in black. The figures show all 20 columns of 
/3sim but only the first 40 rows. (For the Partial and Independent scenarios, the number of nonzero coefficients appearing in 
the first 40 rows is exaggerated above what would be expected by chance for illustration purposes.) 

Table 3 shows the performance of each of the methods on five random instances of these data sets. Test-set errors for the 
"True Model" are those obtained with logistic regression for each response separately using a model containing exactly the 
features of /3sim that had nonzero coefficients. In addition to test-set error, we show precision and recall metrics. Coefficient- 
level precision and recall refer to individual coefficients: For those coefficients of the data-generating f3sini that were nonzero, 
did our final /3 show them as nonzero, and vice versa? Feature-level precision and recall look at the same question for entire 
features: If a row of Psim had any nonzero coefficients, did our corresponding row of /3 have any nonzero coefficients (not 
necessarily the same ones), and vice versa? 

The best test-set error values are bolded (though not necessarily statistically significant). As we would expect, each of 
Partial MIC, Full MIC, and RIC is the winner in the synthetic-data regime for which it was designed, although Partial MIC 

^^We mentioned in Section 2.2.2 that MIC performed best when we used S^i, the diagonal covariance-matrix estimate. One might wonder 
whether this was due only to the fact that we created our synthetic test data without covariance among the responses. However, this was not the 
case. When we generated synthetic noise using a correlation matrix in which each off-diagonal entry was a specific nonzero value (in particular, 
wc tried 0.4 and 0.8), the same trend appeared: MIC with E/j had slightly lower test error. 
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Table 3: Test-set accuracy, precision, and recall of MIC and other methods on 5 instances of the synthetic data sets generated 
as described in Section 2.4.1. Because synthetic data is cheap and our algorithms, once trained, are fast to evaluate, we used 
10,000 test data points for each data-set instance. Standard errors are reported over each task; that is, with 5 data sets and 
20 tasks per data set, the standard errors represent the sample standard deviation of 100 values divided by \/T00 (except 
for feature-level results, which apply only to entire data sets and so are divided by Vb). Baseline accuracy, corresponding to 
guessing the majority category, is roughly 0.5. AndoZhang's NA values are due to the fact that it does not explicitly select 
features, 



Method 


True Model 


Partial MIC 


Full MIC 


RIC 


AndoZhang 


BBLasso 


Partial Synthetic Data Set 


Test error 


0.07 ±0.00 


0.10 ± 0.00 


0.17 ±0.01 


0.12 ±0.01 


0.50 ±0.02 


0.19 ±0.01 


Coeff. precision 


1.00 ±0.00 


0.84 ±0.02 


0.26 ±0.01 


0.84 ±0.02 


NA 


0.04 ±0.00 


Coeff. recall 


1.00 ±0.00 


0.77 ±0.02 


0.71 ±0.03 


0.56 ±0.02 


NA 


0.81 ±0.02 


Feature precision 


1.00 ±0.00 


0.99 ±0.01 


0.97 ±0.02 


0.72 ±0.05 


NA 


0.20 ±0.03 


Feature recall 


1.00 ±0.00 


0.54 ±0.05 


0.32 ±0.03 


0.62 ±0.04 


NA 


0.54 ±0.01 


Full Synthetic Data Set 


Test error 


0.07 ±0.00 


0.08 ± 0.00 


0.08 ± 0.00 


0.11 ±0.01 


0.45 ±0.02 


0.09 ±0.00 


Coeff. precision 


1.00 ±0.00 


0.98 ±0.01 


0.80 ±0.00 


0.86 ±0.02 


NA 


0.33 ±0.03 


Coeff. recall 


1.00 ±0.00 


1.00 ±0.00 


1.00 ±0.00 


0.63 ±0.02 


NA 


1.00 ±0.00 


Feature precision 


1.00 ±0.00 


0.80 ±0.00 


0.80 ±0.00 


0.36 ±0.06 


NA 


0.33 ±0.17 


Feature recall 


1.00 ±0.00 


1.00 ±0.00 


1.00 ±0.00 


1.00 ±0.00 


NA 


1.00 ±0.00 


Independent Synthetic Data Set 


Test error 


0.07 ±0.00 


0.17 ±0.01 


0.36 ±0.01 


0.13 ± 0.01 


0.49 ±0.00 


0.35 ±0.01 


Coeff. precision 


1.00 ±0.00 


0.95 ±0.01 


0.06 ±0.01 


0.84 ±0.02 


NA 


0.02 ±0.00 


Coeff. recall 


1.00 ±0.00 


0.44 ±0.02 


0.15 ±0.02 


0.58 ±0.02 


NA 


0.43 ±0.02 


Feature precision 


1.00 ±0.00 


1.00 ±0.00 


1.00 ±0.00 


0.83 ±0.02 


NA 


0.30 ±0.05 


Feature recall 


1.00 ±0.00 


0.44 ±0.02 


0.14 ±0.02 


0.58 ±0.03 


NA 


0.42 ± 0.06 



competes with Full MIC even in the Full regime. Though the results are not shown, we observed informally that AndoZhang 
and BBLasso tended to have larger differences between training and testing error than MIC, implying that MIC may be less 
likely to overfit. Resistance to overfitting is a general property of MDL approaches, because the requirement of completely 
coding P constrains model complexity. This point can also be seen from the high precision of the information-theoretic 
methods, which, surprisingly, show comparable recall to BBLasso. Full MIC behaves most like BBLasso, because both 
algorithms select entire rows of coefficients for addition to /3. 

2.4.2 Yeast Growth 

Our first real data set comes from [LCCP09, pp. 5-6]. It consists of real- valued growth measurements of 104 strains of yeast 
(n = 104 observations) under 313 drug conditions. In order to make computations faster, we hierarchically clustered these 
313 conditions into 20 groups using single-link clustering with correlation as the similarity measure. Taking the average of 
the values in each cluster produced ft, = 20 real-valued responses, which we then binarized into two categories: values at least 
as big as the average for that response (set to 1) and values below the average (set to 0).^^ The features consisted of 526 
markers (binary values indicating major or minor allele) and 6,189 transcript levels in rich media for a total of m = 6,715 
features. 

The "Yeast Growth" section of Table 4 shows test errors from 5-fold CV on this data set. Though the difference isn't 
statistically significant. Partial MIC appears to outperform BBLasso on test error. In any event. Partial MIC produces a 
much sparser model, as can be seen from the numbers of nonzero coefficients and features: Partial MIC includes nonzero 
coefficients in an average of 4 rows of (3, in contrast to 63 for BBLasso. Partial MIC also appears to outperform Full MIC 
and RIC, presumably because the assumptions of complete sharing or no sharing of features across responses rarely hold 
for real-world data. Like Partial MIC, AndoZhang did well on this data set; however, the algorithm scales poorly to large 
numbers of responses and so took 39 days to run.^' 

^^Thc split was not exactly 50-50, as the data were sometimes skewed, with the mean falling above or below the median. Table 4 reflects this 
fact, showing that a classifier that simply guesses the majority label achieves an average error rate of 0.41. 

^^Much of the reason for this is that AndoZhang, as currently implemented, only produces predictions for one of the responses after it trains on 



12 



Table 4: Accuracy and number of coefficients and features selected on five folds of CV for the Yeast Growth, Yeast Markers, 
and Breast Cancer data sets. Standard errors are over the five CV folds; i.e., they represent (sample standard deviation) 
/ a/5- The "Majority Label" column represents a classifier which guesses the more common of the 0/1 labels seen in the 
training set. AndoZhang's NA values are due to the fact that it does not explicitly select features. 



Method 


Partial MIC 


Full MIC 


RIG 


AndoZhang 


BBLasso 


Majority Label 


Yeast Growth 


Test error 


0.38 ± 0.04 


0.39 ±0.04 


0.41 ±0.05 


0.39 ±0.03 


0.43 ±0.03 


0.41 ±0.04 


Num. coeff. sel. 


22 ±4 


64 ±4 


9± 1 


NA 


1268 ± 279 


NA 


Num. feat. sel. 


4±0 


3±0 


9± 1 


NA 


63 ±14 


NA 


Yeast Markers 


Test Error 


0.34 ±0.07 


0.44 ±0.04 


0.27 ± 0.06 




0.45 ±0.05 


0.53 ±0.03 


Num. coeff. sel. 


20 ± 1 


68 ±26 


37 ±2 


NA 


1044 ± 355 


NA 


Num. feat. sel. 


16 ± 1 


3± 1 


37 ±2 


NA 


52± 18 


NA 


Breast Cancer 


Test error 


0.33 ± 0.08 


0.37 ±0.08 


0.36 ±0.08 


0.44 ±0.03 


0.33 ± 0.08 


0.39 ±0.04 


Num. coeff. sel. 


3±0 


11 ± 1 


2±0 


NA 


61 ± 19 


NA 


Num. feat. sel. 


2±0 


2±0 


2±0 


NA 


12 ±4 


NA 



2.4.3 Yeast Markers 

The Yeast Growth data set described above includes as features both markers and transcripts for 104 yeast strains. We can 
consider these as variables for prediction in their own right, without reference to the growth responses at all. In fact, regression 
of transcripts against markers is commonly done; it's known as "expression quantitative trait loci" (eQTL) mapping [KW06]. 
Since the marker variables are already binary (major or minor allele at the given location), we decided to flip the problem 
around from the usual eQTL setup: Using the m = 6,189 transcripts as features, we predicted a subset of 20 of the 526 
markers (numbered 25, 50, 75, . . . , 500 for a good distribution along the genome). 

The results are shown in the "Yeast Markers" section of Table 4. Unlike the case of the Yeast Growth data, there is 
apparently less sharing of feature information across markers, as RIG appears to outperform Partial MIC on test error. We 
did not run AndoZhang on this data set. 

2.4.4 Breast Cancer 

Our Breast Cancer data set represents a combination of five of the seven data sets used in [vtVDvdV+02]. It contains 
1,171 observations for 22,268 RMA-normalized gene-expression values. We considered five associated responses; two were 
binary — prognosis ("good" or "poor") and ER status ("positive" or "negative") — and three were not — age (in years), tumor 
size (in mm), and grade (1, 2, or 3). We binarized the three non-binary responses into two categories: Response values at 
least as high as the average, and values below the average. Some of the responses were unavailable for some observations, so 
we eliminated those observations, leaving 882. Of those, we kept only the first n = 100, both to save computational resources 
and to make the problem "harder." To reduce the features to a manageable number, we took the m — 5,000 that had highest 
variance. 

Table 4 shows the results. In this case, BBLasso did as well as Partial MIC on test error; however, as usual. Partial MIC 
produced a much sparser model. Sparsity is important for biologists who want to interpret the selected features. 

3 Hypothesis Testing with a Single Response 

As the experimental results in Section 2.4 demonstrate, regression can be an effective tool for predicting one or more responses 
from features, and in cases where the number of features is large, selection can improve generalization performance. However, 
as we noted with the Yeast and Breast Cancer data sets, accuracy is not always the only goal; sometimes model interpretability 
is important. Indeed, in science, the entire goal of regression is often not prediction but rather discovering a "true model," 
by testing hypotheses about whether pairs of variables really have a linear relationship. For instance, an econometrician 

all of them. Thus, in order to predict all 20 responses, we had to run it separately 20 times. It is probably possible to train once and then predict 
all 20 responses simultaneously, but we wanted to avoid introducing errors by changing the code. 
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regressing housing prices on air-pollution levels and resident income [HR78] is not trying to predict price levels but to study 
consumer demand behavior. 

Hypothesis testing with a small number of features is straightforward: Given a regression coefficient, we can compute a 
t statistic, and if it exceeds some threshold, we reject the null hypothesis that the coefficient is 0. The situation becomes 
slightly more complicated when we have many features whose coefficients we want to examine. Section 3.1 reviews some 
standard approaches to this multiple-testing problem, and Section 3.2 recasts them in the light of MDL. This sets the stage 
for an MIC approach to hypothesis testing in Section 4. 

3.1 Multiple- Testing Procedures 

Consider the case of a single response variable y (say, a gene-transcript level) , and suppose we're trying to determine which 
of m features (say, genetic markers) are linearly correlated with the response. If we test each feature at a significance level 
a, the overall probability that we'll falsely reject a true null hypothesis will be much greater than a — this is the problem of 
multiple hypothesis testing. 

3.1.1 Bonferroni Correction 

A standard way to control against so-called alpha inflation is called the Bonferroni correction: Letting Hi, . . . , H,n denote 
the null hypotheses and pi, . . . ,pm the associated p- values, we reject Hj when pj < j^. By Boole's inequality, this controls 
what is known as the family-wise error rate (FWER), the probability of making any false rejection, at level a under the 
complete null hypothesis that all of Hi, . . . , H^ are true.^* In fact, Bonferroni controls FWER in a strong sense as well: For 
any subset of null hypotheses, the probability of falsely rejecting any member of that subset when all members are true is 
also bounded by a [Hoc88, p. 800]. 

The Bonferroni correction is a single-stage testing procedure in which, unfortunately, testing a larger number of hypotheses 
reduces the power for rejecting any one of them [Sha95, p. 569]. This is especially problematic when the number of hypotheses 
tested is in the thousands or tens of thousands, as is common with, e.g., microarray or fMRI data. However, statisticians have 
developed several multistage testing procedures that help to overcome this limitation by conditioning rejection thresholds for 
some test statistics on the results of rejections by others. 

3.1.2 Improvements on Bonferroni 

One of the ffist of these was the Holm step-down procedure [Hol79]. Here, we let P(i), ■ ■ ■ ,P{m) denote the p- values sorted in 
increasing order and . ■ • ,^f(m) the corresponding null hypotheses. We begin by looking at p(i): If it fails to be < ^, 
we stop without rejecting anything. (This is what Bonferroni would do as well, since no p-value is < ^.) However, if we 
do reject we move on to if(2) and reject if and only if P(2) < . We continue in this manner, rejecting 7?^) when 

P{j) ^ m-j+i ' until we fail to reject a hypothesis. Like Bonferroni, the Holm method controls FWER in the strong sense for 
independent or dependent test statistics [Hoc88, p. 800]. 

Simes [Sim86, p. 751] proposed a more lenient approach: Reject the complete null hypothesis if any of the following 
inequalities holds, j = 1, 2, . . . , m: 



This controls FWER for independent test statistics at level a. Moreover, simulation studies suggested that it remained true 
for various multivariate-normal and multivariate-gamma test statistics [Sini86, p. 752]. Unfortunately, unlike the Bonferroni 
and Holm procedures, the Simes approach says nothing about rejecting individual hypotheses once the complete null is 
rejected [Sim86, p. 754], although [Hoc88] and [Hom88] subsequently proposed limited procedures for doing so. 

3.1.3 FDR and the BH Procedure 

In his closing discussion, Simes [Sim86, p. 754] proposed that if we wanted to reject individual null hypotheses, we might 
reject the ordered hypotheses -ff(i), . ■ ■ , -^(q) such that 



m < 



m 




(25) 



^^When the test statistics are independent or positive orthant dependent, one can replace ^ by 1 — (1 — a) ™ , but the resulting improvement in 
power is small for small a [Sha95, p. 570]. 
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However, he cautioned that there was "no formal basis" for this. Indeed, [Honi88, p. 384] showed that in certain cases, the 
FWER of this procedure could be made arbitrarily close to 1 for large m. 

However, Benjamini and Hochberg [BH95] pointed out that the Simes procedure could be said to control a different 
measure, which they called the false-discovery rate (FDR). Letting V denote the (unobservable) random variable for the 
number of true null hypotheses rejected and S the (unobservable) random variable for the number of correctly rejected null 
hypotheses, 

V + S>0 

(where V + S, the number of null hypotheses rejected in total, is observable). Thus, FDR is the expected proportion of falsely 
rejected null hypotheses. This statistic is more flexible than FWER because it accounts for the total number of hypotheses 
considered; if, for instance, we had m = 1,000,000 hypotheses, the FWER would be very high, but the proportion of false 
rejections could still be low. Because of this flexibility for large to, FDR has become standard in bioinformatics, neuroscience, 
and many other flelds. 

[BH95, p. 293] showed that the Benjamini-Hochberg (BH) step-up procedure (25) controls FDR at a for independent test 
statistics and any conflguration of false null hypotheses. [BYOl] extended this result to show that the procedure also controlled 
FDR for certain types of positive dependency. This included positively correlated and normally distributed one-sided test 
statistics, which occur often in, for instance, gene-expression measurements [RYB03, p. 370]. 

Various extensions to the BH procedure have been proposed. For instance, [BL99] suggested a step-down approach for 
independent test statistics that, while not dominating the BH step-up procedure, tended experimentally to yield higher power. 
[BYOl, p. 1169] showed that replacing a by a/X^jli j would control FDR at a for any test-statistic correlation structure, 
though this is often more conservative than necessary (p. 1183). [YB99] proposed a resampling approach for estimating FDR 
distributions based the data, in order to gain increased power when the test statistics were highly correlated. Many further 
modifications to the BH procedure have been put forward since. However, in this thesis, we stick with the original version 
given by (25). 

3.2 Hypothesis Testing by MDL 

MDL compares log-likelihoods of data under various models, choosing a more complicated model if and only if it has 
sufficiently lower negative log-likelihood. Appendix A shows that, in fact, this process is equivalent to a standard statistical 
procedure known as a generalized likelihood-ratio test, at some implied significance level a. 

Hypothesis testing differs slightly from regression, however. The goal with regression is to minimize test-set error by 
choosing a few highly informative features. Because MIC penalizes for each feature, if we find several, very similar features 
that are all correlated with the response, stepwise regression by MIC will likely include only one of them. For instance, 
suppose the true model is y = /i -I- 2/2 -I- e, where /i and /2 are nearly identical features. Not wanting to waste model-coding 
bits, regression MIC would probably give the model y = 3/i + e or y — 3/2 + e. 

With hypothesis testing, we aim to find all of the features that are significantly correlated with the response. In the 
above example, we would hope to include both /i and /2 in our set of relevant features. To do this, we regress Y on each 
feature in isolation; we keep those features that, according to MDL, deserve nonzero regression coefficients. Conceptually, we 
can think of this as a for- loop over features, in a given iteration of which we call an MIC-stepwise-regression function using 
the regular Y matrix but with an X matrix that contains only the current feature (and an intercept). There's one catch, 
though: Since our for-loop searches through to potential features to add, we're effectively doing to hypothesis tests, and we 
need to incorporate some sort of multiple-testing penalty. Below we describe a way to motivate such a penalty from an MDL 
perspective. 

Section 1.2 described a scenario to motivate ordinary MDL regression: Sender wanted to transmit data Y to Receiver 
when both Sender and Receiver knew the value of an associated matrix X of features. Now suppose instead that there are 
m different Receivers, where Receiver i knows the values only of the i^^ feature (i.e., the i"^ column of X). All the Receivers 
want to reconstruct Y, so Sender has to transmit messages to each of them telling them how to do it, possibly using the 
feature that they individually know. The messenger Hermes visits Sender and tells her some ground rules: Sender must 
transmit directly to each Receiver only information about how to reconstruct Y given a model. All model information itself 
(i.e., all the /3 coefficients) Sender has to transmit to Hermes, who will then visit each Receiver and tell him his appropriate 
coefficient (possibly if Sender hasn't bothered to encode a model for that feature). As a bonus. Sender is allowed to include 
in each model the intercept coefficient of Y for free, meaning that if Sender doesn't specify a feature regression coefficient for 
Receiver i, that Receiver at least gets the average Y to use in reconstructing Y. Letting /3q denote the model Sender tells 
Hermes, Sender's total description length is 

V0,)+V^{Y\p,), 



FDR := P{V + S >0) E 



V 
V + S 
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where 



m 

I)'"(y|^,) (26) 

1=1 

Here, we define AAi to be a null model, containing only the free intercept term, if Sender doesn't specify a nonzero regression 
coefficient for Receiver iin (3q. Otherwise, Mi contains both a costless intercept term and the regression coefficient for feature 
i. 

The result should be something like a series of hypothesis tests, in which the features in Pq correspond to the rejected 
null hypotheses. The idea is that, because Sender has to tell Hermes which coefficients go to which Receiver, the coding of 
(5q will involve a penalty that grows with the number of possible features, which is what we need to protect against alpha 
inflation from multiple tests. 

3.3 MDL Multiple-Testing Corrections 

We consider two coding schemes that Sender might use to tell Hermes which features are contained in her model [3q. 



3.3.1 Bonferroni- Style Penalty 

Suppose m = 5,000 and Sender wants to use nonzero coefficients for features 226, 1,117 and 3,486. One way to tell Hermes 
this is to transmit the binary representation of each of these numbers, using an idealized Igm bits for each one. Then, as in 
Section 1.2.2, Sender spends 2 bits to specify each coefficient. The resulting message to Hermes costs 31grn + 3 • 2 bits. In 
general. Sender's description length will be 

q\gm + 2q + V^{Y\Pq). (27) 

For each feature j, let — Ig Aj denote the decrease in residual coding cost that would result from using a nonzero coefficient 
for feature j. (See the Appendix for the motivation behind this notation.) Then minimizing (27) is equivalent to the following 
decision rule: Looking at the features in decreasing order of their — IgA^ values, add feature j to the model as long as 

-lgAj>lgTO + 2. (28) 

Interestingly, (39) in Appendix A. 4 shows that this is equivalent to a Bonferroni correction at some implied a. In fact, 
according to Appendix A. 5, taking a cost per coefficient around 2,^^ as we did for stepwise regression, corresponds to an a 
around 0.05. 



3.3.2 BH-Style Penalty 

If Sender expects to code more than one coefficient, rather than coding an entire index to specify each feature, she may 
find it advantageous to use a scheme similar to (21). There, the scheme was used to convey a subset of nonzero response 
coefficients in Partial MIC; here it would describe a subset of features from {1, . . . , m}. Using this code. Sender's description 
length using (3q would be 

Ig* q + c,„ + Ig {^^ +2q + V^{Y I Pq). (29) 

Apart from the two (small) terms at the beginning, this can be made to correspond at some implied a^^ with (44) and thus 
approximately with the ordinary BH procedure. 



4 MIC for Hypothesis Testing 

As was the case with regression, real-world hypothesis-testing problems often involve multiple responses. [TVW05, sec. 1] 
discuss the task of identifying a subset of 770 wavelength features that best predict 14 chemical quantities (e.g., pH, organic 
carbon, and total cations). Expression quantitative trait loci (eQTL) mapping is the process of looking for correlations 
between potentially thousands of gene-expression transcripts and genetic markers (e.g., [SMD+03]). Section 4.1 reviews some 
existing ways to deal with this problem, and then Section 4.2 describes the MIC approach. Experimental comparisons of the 
various methods are described in Section 4.3. 

^^2.77 to be exact, for m = 1 and one degree of freedom, though this changes somewhat with m and the number of degrees of freedom. 
■^"In fact, the same a as for the Bonferroni case above. 
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4.1 Existing Approaches 

We describe a few examples of ways to perform hypothesis testing with multiple responses. 

4.1.1 Bonferroni and BH 

Probably the most straightforward approach is to treat each response separately and apply, say, the Bonferroni or BH proce- 
dures of Section 3.1 one response at a time. We'll refer to these approaches as simply "Bonferroni" and "BH," respectively. 

According to [KCY+06, p. 20], a number of early eQTL studies took this approach, applying single-transcript mapping 
methods to each transcript separately. However, because such methods are designed to control false positives when analyzing 
a single transcript, multiple tests across transcripts may result in inflated FDR [KCY+06, p. 23]. 

4.1.2 BonferroniMatrix and BHMatrix 

To account for the greater number of hypothesis tests being performed with h > I responses, we might penalize not just by 
^ but The procedure that consists in applying the latter Bonferroni threshold to each p- value in our m x h matrix 
of p-values is what we'll call "BonferroniMatrix." Effectively, we're imagining our matrix of p-values as one long vector and 
applying the standard Bonferroni correction to that. 

Similarly, we can imagine turning our matrix of p-values into a single vector and applying the BH method to it: We'll call 
this approach "BHMatrix." In contrast to BH, it has a harsher starting penalty rather than ^), but it also allows us to 
pick out the best p-values from any location in the entire matrix, not just p-values in the particular column (response) that 
we're currently examining. [KCY+06, p. 21] describes a conceptually similar approach, called "Q-ALL," that used so-called 
q-values [Sto03] to identify significant marker-transcript correlations. 

The result of BonferroniMatrix at level a is the same as that of regular Bonferroni at level so if we're looking at a 
range of significance levels, it suffices to examine just Bonferroni or BonferroniMatrix; we'll show results for Bonferroni only 
in what follows. The same is not true for BH vs. BHMatrix, so we report on both methods. 

4.1.3 Empirical Bayes 

The BHMatrix approach does not treat each response separately, since, for example, if one response has lots of low p-values 
that pass the harshest thresholds of the step-up procedure, it can leave easier thresholds for the other responses. But this 
doesn't really "share information" across responses in a meaningful way. 

One approach that does is called EBarrays [KNLG03], which uses an empirical Bayes hierarchical model to estimate 
a prior for the underlying means of each transcript (response), allowing for stable inference across transcripts despite the 
small sample size of each individually. The goal is to determine, at each marker (feature), which transcripts show significant 
association. Originally designed for sharing transcript information one marker at a time, the method has been extended 
to a mixture over markers (MOM) model that uses the EM algorithm to assign probabilities that a transcript correlates 
to each marker [KCY+06]. As the names suggest, these methods apply specifically to eQTL problems, though of course, 
the empirical-Bayes shrinkage framework applies more generally. We give this as just an example of previous approaches to 
sharing strength across responses that have been developed for hypothesis testing. 

4.2 The MIC Approach 

While the EBarrays approach shares information across responses, it fails to take advantage of one potentially important 
source of shared strength: Namely, that features strongly associated with one response are perhaps more likely to be associated 
with other responses. In the case of eQTL, for example, we might suspect this type of sharing across responses by marker 
features corresponding to <rans-regulatory elements, which affect expression of many different genes. 

On the other hand, the MIC approach of Section 2 does pick up on sharing of features across responses, so we propose 
a method for hypothesis testing by MIC. Just as the single-response version of hypothesis testing by MDL in Section 3.2 
applied single-response regression to each feature individually, so we can also imagine applying the multiple-response MIC 
regression approach of Section 2.2 to each feature individually in order to do multiple-response hypothesis testing. 

This amounts basically to a for- loop of the MIC Stepwise Regression Algorithm 2.1 over each feature, except that the 
costs of coding the features are different from usual because of the setup that Hermes imposed (see Section 3.2). Just as 
in that section, we can consider a Bonferroni-style ( "Bonferroni-MIC" ) and a BH-style ("BH-MIC") coding scheme, using 
feature penalties of qlgm and Ig* g -I- Cm + Ig (™), respectively. Below we make each of these approaches more explicit. 



17 



4.2.1 Implementation 

With Bonferroni-MIC, we can evaluate each feature one at a time for inclusion based on whether StepwiseRegressionMIC 
(Algorithm 2.1) would have included nonzero coefficients for that feature, with the following exception. That algorithm 
includes a feature penalty of Ig of the number of features (columns) in the given X matrix. Here, when we give the algorithm 
only one column of X at a time, this term would normally be just Ig 1. However, the coding scheme that Hermes imposed 
requires us instead to use Igm bits to specify a feature, where m is the total number of features in X. Pseudocode appears 
in Algorithm 4.1. 

Algorithm 4.1: Bonferroni-MIC(X, y, method) 


for i = 1 to TO // Loop over the features. 
Essentially, we call 

Pi ^ StepwiseRegressionMIC (Xi, F, method), 
except that the usual penalty term corresponding to Ig of the number of columns of the 
input Xi matrix (just 1 here) is replaced by IgTO, where m is the number of columns of X. 
return (/3) 



BH-MIC is slightly trickier, because the cost of adding a new feature depends on how many features are already in the 
model. Our approach, outlined in Algorithm 4.2, is to loop through features and evaluate whether they would include nonzero 
coefficients when no feature-coding cost is imposed (i.e., no (IgTO)-like term is charged). This fills (3 with a possibly inflated 
number of nonzero coefficients that we subsequently trim down if necessary. To do this, we evaluate the cost of keeping 
the best q of the currently nonzero features for each q from to the current number of nonzero features Qorig, choose the 
optimal value q* , and zero out the rest. Note that zeroing out those rows of /3 is sufficient; we have no need to go back and 
recompute the optimal subset of responses for the remaining features, because the cost to specify the included features is the 
same regardless of the response subset. 



Algorithm 4.2: BH-MIC(A:, y, method) 


for i = 1 to TO // Loop over the features. 
Essentially, we call 

Pi <— Step wiseRegressionMIC (Xi, y, method), 
but without any cost to code features (the usual IgTO term). We also record Si, 
the (positive) number of bits saved by using the current subset of nonzero coefficients j3i 
instead of having all the coefficients zero. If Pi —= Oix/i, i.e., all the coefficients 
are already all zeros, Si <— 0. 
Sort the Si in decreasing order and call them S(i). S(i) corresponds to saving the most bits, etc. 
9orig ^ number of nonzero features currently in /? (possibly too many) 

( Q / / \ 

TO^ 



max 

9e{l,...9oriE} 



1 




(or (7* ^ if none of these is positive) 



Zero out the rows of /3 corresponding to features whose inclusion saves fewer than bits, 
return (/3) 



In both Bonferroni-MIC and BH-MIC, the algorithmic complexity is dominated by the for-loop. In Section 2.3, we saw 
that a call to StepwiseRegressionMIC required 0{mirh^) evaluations of description length, where we've used rrii to 
denote the number of columns of X^. In particular, TOj = 1 for each i and r is at most 1, so the complexity within a given 
for-loop iteration is only 0{h^). Overall, then, these algorithms evaluate description length 0{mh?) times. 



4.2.2 Hypothesis- Testing Interpretation 

Consider a single feature. In Section 3.2, we noted the correspondence between MDL regression on that feature and a 
statistical likelihood-ratio test of whether its regression coefficient was nonzero. With h > \ responses, MIC hypothesis 



18 



testing searches over subsets of responses to have nonzero coefficients with the feature. Each evaluation of description length 
during this process can be interpreted as a hypothesis test, in the way explained in the Appendix. In particular, when we 
evaluate the description length for some subset S C {1,2, of the responses being nonzero, we're implicitly doing a 

likelihood-ratio test with these null and alternative hypotheses: 

Hn : B = Oixh vs. , ^ 

' 30 
Hi : A 7^ for i e S, ^ ^ 

where f3 is the 1 x ft, matrix of true regression coefficients of the feature with the ft responses. In theory, MDL performs 
exponentially many such tests, one for each subset S,^^ though they are obviously highly correlated. We can interpret 
the Ig* fc + c/i + Ig C^) penalty to specify the subset of k of the responses (as described in Section 2.2.1) as something of a 
multiple-testing correction for doing all of these implicit hypothesis tests. 

Section 3.3 and the corresponding portions of the Appendix pointed out the approximate correspondence between partic- 
ular penalties and particular corrections: For instance, that Ig m in log-likelihood space is essentially a Bonferroni correction 
in p-value space, or that a cost per coefficient of 2.77 approximately corresponds to a = 0.05. With ft > 1 responses, these 
approximations are somewhat rougher. The reason is that in (30), k = \S\, the difference in dimension of the parameter 
spaces between Hq and Hi, is often greater than 1. In this case, — 21n A ~ xfk)' ^^'-^ ^"-"^ fc > 1, the standard-normal approxi- 
mation using 4> in Appendix A. 3 no longer goes through. Nevertheless, we can expect the Bonferroni- and BH-style penalties 
from Section 3.3 to do roughly the right things for multiple responses; we needn't make them match the Bonferroni and BH 
procedures exactly. (If we did, what would be the point of using information theory?) 



4.3 Experiments 

When we measure the performance of our algorithms in terms of test-set error, we can use cross-validation to run experiments 
on real data sets, as we did in Section 2.4. When, instead, we need to judge how well our algorithm identifies truly nonzero 
coefficients, we have to fall back to synthetic data. That's because the real world doesn't tell us the true /3 matrix^^ — we 
only know it if we make it ourselves. Thus, in this section, we rely entirely on synthetic data. However, we can sometimes 
base our synthetic data on a real-world data set, and we do this with the Yeast Growth data in Section 4.3.2. 

For each of the scenarios below, we generated 25 random instances of the data sets, corresponding to 25 random (3sim 
matrices, taking n = 100 training data points and ft = 20 responses. Because synthetic data is cheap, we evaluate test-set 
error on 10,000 test data points. We calculate error for each response as SSE/-y/n, where SSE is the sum of squares error 
from the ordinary-least-squares regression of the given response on exactly the features selected by the algorithm (plus an 
intercept). We report precision and recall at the coefficient level (whether each particular nonzero coefficient was selected). 
We calculate these results separately for each response in each data set, so that standard errors represent standard deviations 
divided by V20 • 25. 

We compare the following feature-selection approaches: 

• "Truth" : Use an oracle to select exactly the true features. 

• "Bonf,a=ao": For each response separately, select those features whose p- values with the response are < ao/m. We 
calculate the p-value for a given feature and response by regressing the response against only that feature and an 
intercept, and evaluating the p-value of the slope regression coefficient. 

• "Indep,cpc=Co" : Apply the RIG algorithm described in Section 2.2 to each response separately, using cq as the cost to 
code a coefficient value, (cq does not include the cost to specify which features enter the model, which is always Ig m. 
It shouldn't fall below 0, and the lowest value we tried was 0.1.) 

• "Bonf-MIG" : Bonferroni-style MIG, as described in Algorithm 4.1. The cost to code a coefficient is 2 bits. 

• "BH,a=Q!o": For each response separately, calculate p-values the same way as with "Bonf,a=:aoi" but apply the BH 
procedure (25) at level ao instead of the Bonferroni correction. 

• "BHMat,Q;=Q;o" : Apply the BH procedure at level ao to the entire m x h matrix of p-values all at once. 

• "BH-MIG": BH-style MIG, as described in Algorithm 4.2. The cost to code a coefficient is 2 bits. 

^^Of course, in practice we do only 0{h?) of them during our stepwise-style search through responses to include. 

■^■^Indeed, there may not be any true /3 matrix, because the data are not really generated according to our model (18), though the approximation 
works well enough. 
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The MIC methods have no free parameters (if we fix the cost per coefficient at 2 bits) , so they give point values for precision 
and recall. To allow for comparison, then, we ran the other methods at a variety of ao or cq levels, presenting results for 
ao or Co levels at which precision approximately matched that of MIC; relative performance can then be assessed based on 
recall. Because we tried only a discrete grid of ao or cq values, precisions did not always match exactly, so we took the highest 
precision not exceeding the MIC value. This puts MIC at a slight disadvantage in comparing recall, because its precision is 
often slightly higher than that of the other methods. 

In our tables below, we make bold those results that represent the best performance among the three Bonferroni-style 
methods and also among the three BH-style methods. Here, "best performance" is based on the observed average value, but 
the differences are often not statistically significant. 

4.3.1 Simulated Synthetic Data 

We created three data sets in the same manner as described in Section 2.4.1, the only difference being that we took m — 1,000 
features instead of 2,000 (for no particular reason). Table 5 shows the results for the Partial, Full, and Independent scenarios. 

MIC appears to have had worse test-error than the other methods except in the Full scenario, where it performed close to 
the Truth. As the especially poor Independent performance suggests, this is probably due to the fact that MIC has a harder 
time picking up on single nonzero coefficients in a given row of /3sim, due to its "higher overhead" cost to put them in (see 
Table 2 for fc = 1). 

Of course, for hypothesis-testing algorithms, test error is not the most important metric. On recall, MIC does outperform 
its competitors in the Partial and especially Full scenarios. And as we would expect, it performs worse in the Independent 
case, again due to its high overhead for individvial coefficients on a row of f3. 



Partial Scenario 



Method 


Truth 


Bonf,a=l Indep,cpc=0.1 Bonf-MIC 


BH,a=0.3 BHMat,a=0.3 BH-MIC 


Train Err 
Test Err 
Coeff Prec 
Coeff Rec 


0.31 ±0.00 
0.32 ±0.00 
1.00 ±0.00 
1.00 ±0.00 


0.52 ±0.01 0.63 ±0.01 0.56 ± 0.01 
0.57 ±0.01 0.66 ±0.01 0.59 ± 0.02 
0.74 ±0.01 0.96 ±0.00 0.76 ± 0.01 
0.60 ±0.01 0.53 ±0.01 0.71 ±0.01 


0.51 ±0.01 0.52 ±0.01 0.53 ±0.01 
0.56 ±0.01 0.57 ±0.01 0.57 ±0.01 
0.70 ±0.01 0.73 ±0.01 0.74 ±0.01 
0.61 ±0.01 0.60 ±0.01 0.73 ±0.01 


Full Scenario 


Method 


Truth 


Bonf,Q=2 Indep,cpc=0.1 Bonf-MIC 


BH,a=0.45 BHMat,a=0.5 BH-MIC 


Train Err 
Test Err 
Coeff Prec 
Coeff Rec 


0.31 ±0.00 
0.32 ±0.00 
1.00 ±0.00 
1.00 ±0.00 


0.49 ±0.01 0.61 ±0.01 0.30 ±0.00 
0.54 ±0.01 0.63 ±0.01 0.33 ± 0.00 
0.62 ±0.01 0.97 ± 0.00 0.62 ± 0.01 
0.61 ±0.01 0.53 ±0.01 0.99 ±0.00 


0.49 ±0.01 0.48 ±0.01 0.30 ± 0.00 
0.54 ±0.01 0.53 ±0.01 0.33 ± 0.00 
0.59 ±0.01 0.57 ±0.01 0.61 ± 0.01 
0.61 ±0.01 0.61 ±0.01 0.99 ±0.00 


Independent Scenario 


Method 


Truth 


Bonf,a=0.1 Indep,cpc=0.1 Bonf-MIC 


BH,a=0.04 BHMat,a=0.05 BH-MIC 


Train Err 
Test Err 
Coeff Prec 
Coeff Rec 


0.31 ±0.00 
0.32 ±0.00 
1.00 ±0.00 
1.00 ±0.00 


0.59 ±0.01 0.59 ±0.01 0.82 ±0.02 
0.62 ±0.01 0.62 ±0.01 0.86 ± 0.02 
0.96 ±0.01 0.96 ±0.01 0.96 ± 0.01 
0.53 ±0.01 0.54 ±0.01 0.40 ± 0.01 


0.59 ±0.01 0.59 ±0.01 0.70 ± 0.01 
0.62 ±0.01 0.62 ±0.01 0.73 ± 0.02 
0.96 ±0.01 0.96 ±0.01 0.96 ± 0.01 
0.54 ±0.01 0.54 ±0.01 0.47 ±0.01 



Table 5: Test-set accuracy, precision, and recall of MIC and other methods on 25 instances of the synthetic data sets generated 
as described in Section 2.4.1, except with m — 1,000. 



4.3.2 Simulated Yeast Data 

We created a synthetic Yeast Growth data set based on the one described in Section 2.4.2. As will be explained below, 
talking about the "true coefficients" of a synthetic model requires that the features be relatively uncorrelated. However, 
the transcript features of the original data set were highly correlated, so we included only the 526 marker features in the X 
matrix. As before, we made use of the 20 growth clusters as the Y matrix. We created four types of data sets with varying 
degrees of correlation among the features, but the basic data-generation process was the same for each, and we describe it 
below. 



20 



We started by trying to approximate what the true distribution of nonzero coefficients in (3 might look like by running 
the RIC algorithm described in Section 2.2 with a cost of 3 bits per coefficient. The resulting 526 x 20 /3 matrix contained 33 
nonzero coefficients, many of them in blocks of four or five down certain columns due to correlation among adjacent marker 
features. 

The next step was to construct a true j3 matrix /3sim resembling, but not identical to, (3?'^ We did this based on summary 
statistics about (i: What fraction / of the features had any nonzero coefficients? (/ = 0.05 for the actual j3.) Of those that 
did, what was the average number a of nonzero coefficients in each row? (a = 1.3 here.) And what were the sample mean /x^ 
and standard deviation cr^ of the nonzero coefficient values? (Actual values were —0.12 and 0.20, respectively.) We initialized 
an empty 526 x 20 matrix /3sim and, to fill it in, walked down the rows, each time flipping a coin with probability / to decide 
whether to give that row nonzero coefficients. We drew the number of nonzero coefficients from a Poisson distribution with 
rate a (capped at 20, the total number of responses) and distributed those coefficients randomly among the 20 columns. 
The values of each of those coefficients were drawn independently from a normal distribution with mean fip and standard 
deviation ct^. When we had finished walking down the rows, we checked to make sure that the total number of nonzero 

coefficients in Psim was within 25% of that in the original /?; if not, we started the process over. 

We then constructed a simulated 100 x 526 X matrix Xgim by drawing each row from a multivariate-normal distribution 
with some covariance S'^ based on the covariance matrix of the real X matrix. The first three variants of our synthetic data 
set consisted of taking S'^ to be one of S;^ (diagonal), with A = 0.5 (half diagonal, half full), and E;^ (full), with the 
subscripts as in Section 2.2.2. Figure 2 plots, respectively, S;^, Sjfs, and S;^ as correlation matrices (i.e., each entry is scaled 
to fall in the range [0, 1]), with red indicating "high correlation" and blue, "low correlation." 

All three of the above variants involved normally distributed feature values, even though X itself consisted only of O's 
and I's (minor and major allele, respectively). Thus, as the fourth variant, we took Xsira to be the original binary-valued X 
matrix. 




Figure 2: Correlation matrices of the 526 marker features in the Yeast Growth data set. The figures correspond to, from left 
to right, diagonal shrinkage, half-diagonal shrinkage, and no shrinkage. Red — correlation near 1, yellow — correlation near 
0.8, light blue = correlation near 0.2, dark blue = correlation near 0. Due to the relatively high correlation among features, 
measures of precision and recall are slightly misleading, because which features are "truly correlated" with responses becomes 
unclear, as the text explains. 

Finally, we estimated Sjj in the manner of Section 2.2.2 with only an intercept feature in the model. With these matrices 
in place, we computed 

^iin — ^sim/^sim ^~ ^simi 

with each row of esi,„ being independently distributed as A/'/i(0, S^)). 

Table 6 shows the results for each variant of the data set. In general, MIC performs essentially indistinguishably from the 
other methods. This is probably because a, the average number of responses per feature, is only 1.3 for this particular data 
set, implying that there isn't substantial sharing of features across responses. There does appear to be enough sharing that 
MIC doesn't perform as much worse than the other methods as in the case of the Independent synthetic data set in Table 5. 

The precision of each of the methods degrades significantly as the correlation among the features increases. In fact, these 
numbers are somewhat misleading, because the notion of a "true coefficient" becomes fuzzy when features are correlated. 

■^■^If we took /3sim to be equal to (5, then the RIC algorithm might have artificially high precision and recall. That's because we're defining the 
"true features" for the synthetic data set as those that the RIC algorithm returned on the real data set. Since we designed the synthetic data set 
to imitate the real one, we might expect RIC to return similar patterns of coefficients in both cases. Generating a new matrix to serve as fisii-n also 
allows for randomization over multiple instances of this synthetic data set. Of course, one downside of this simulation approach is that correlational 
structure from the original problem is lost. 
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We generated the data by imposing a particular Psim matrix, with nonzero coefficients in certain locations. For instance, for 
response 3, we might have given feature 186 a coefficient of 0.2, causing a correlation between that feature and that response. 
However, because feature 186 is correlated with, say, features 185 and 187, features 185 and 187 will also be correlated 
with response 3, so we end up selecting them as well. This isn't necessarily wrong to do, because the data really do show 
a correlation; the arbitrariness of our choice of zeros and nonzeros in Psim is at fault for the large apparent rate of "false 
positives." 



Each row of X distributed A/'m(0,S;^) — no correlation among features. 



Method 


Truth 


Bonf,a=0.1 


Indep,cpc=0.1 


Bonf-MIC 


BH,a=0.06 


BHMat,a=0.085 


BH-MIC 


Train Err 
Test Err 
Coeff Prec 
Coeff Rec 


0.02 ±0.00 
0.02 ±0.00 
1.00 ±0.00 
1.00 ±0.00 


0.02 ±0.00 
0.03 ±0.00 
0.92 ±0.01 
0.80 ±0.01 


0.02 ±0.00 
0.02 ±0.00 

0.90 ±0.01 
0.81 ±0.01 


0.02 ±0.00 
0.03 ±0.00 
0.92 ±0.01 
0.79 ±0.01 


0.02 ±0.00 
0.02 ± 0.00 
0.91 ±0.01 
0.81 ± 0.01 


0.02 ±0.00 
0.02 ±0.00 
0.91 ± 0.01 
0.81 ±0.01 


0.02 ±0.00 
0.02 ±0.00 

0.92 ± 0.01 
0.80 ±0.01 




Each row of X distributed A/'m(0, S 


^g) — some correlation among 


features. 




Method 


Truth 


Bonf,a=0.055 


Indep,cpc=l 


Bonf-MIC 


BH,a=0.03 


BHMat,a==0.045 


BH-MIC 


Train Err 
Test Err 
Coeff Prec 
Coeff Rec 


0.02 ±0.00 
0.02 ±0.00 
1.00 ±0.00 
1.00 ±0.00 


0.03 ±0.00 
0.03 ±0.00 

0.78 ±0.01 
0.79 ±0.01 


0.03 ±0.00 
0.03 ±0.00 
0.76 ±0.01 
0.79 ±0.01 


0.02 ±0.00 
0.03 ±0.00 
0.78 ±0.01 
0.80 ±0.01 


0.02 ±0.00 
0.03 ±0.00 
0.75 ±0.01 
0.79 ±0.01 


0.02 ± 0.00 
0.03 ±0.00 
0.75 ±0.01 
0.79 ±0.01 


0.02 ±0.00 
0.02 ± 0.00 
0.75 ±0.01 
0.80 ±0.01 




Each row of X distributed AfmiO, S 


p) — lots of correlation among 


features. 




Method 


Truth 


Bonf,a=0.19 


Indep,cpc=0.1 


Bonf-MIC 


BH,a=0.07 


BHMat,Q=0.06 


BH-MIC 


Train Err 
Test Err 
Coeff Prec 
Coeff Rec 


0.02 ±0.00 
0.02 ±0.00 
1.00 ±0.00 
1.00 ±0.00 


0.02 ±0.00 
0.02 ±0.00 
0.19 ±0.01 
0.84 ± 0.01 


0.02 ±0.00 
0.02 ±0.00 
0.20 ±0.01 
0.84 ±0.01 


0.02 ±0.00 
0.02 ±0.00 
0.19 ±0.01 
0.84 ± 0.01 


0.02 ±0.00 
0.02 ±0.00 
0.15 ±0.01 
0.86 ±0.01 


0.02 ±0.00 
0.02 ±0.00 

0.15 ±0.00 
0.85 ±0.01 


0.02 ±0.00 
0.02 ±0.00 

0.15 ±0.01 
0.85 ±0.01 


Real, original X matrix. 


Method 


Truth 


Bonf,a=0.17 


Indep,cpc=0.1 


Bonf-MIC 


BH,a=0.06 


BHMat,a=0.055 


BH-MIC 


Train Err 
Test Err 
Coeff Prec 
Coeff Rec 


0.02 ±0.00 
0.02 ±0.00 
1.00 ±0.00 
1.00 ±0.00 


0.02 ±0.00 
0.02 ± 0.00 

0.15 ±0.00 
0.82 ±0.01 


0.02 ±0.00 
0.02 ±0.00 
0.15 ±0.00 
0.81 ±0.01 


0.02 ±0.00 
0.02 ± 0.00 
0.15 ±0.00 
0.83 ±0.01 


0.02 ±0.00 
0.02 ± 0.00 
0.13 ±0.00 
0.84 ±0.01 


0.02 ±0.00 
0.02 ±0.00 
0.13 ±0.00 
0.84 ±0.01 


0.02 ±0.00 
0.02 ± 0.00 
0.13 ±0.00 
0.84 ± 0.01 



Table 6: Test-set accuracy, precision, and recall of MIC and other methods on 25 instances of the synthetic data sets 
generated as described in Section 4.3.2, with m = 526 features. 



5 Conclusion 

The MDL principle provides a natural framework in which to design penalized-regression criteria for feature selection. In the 
case of a single response, one example of this is RIC, which can be characterized by an information-theoretic penalty of 2 Ig to 
bits per feature when selecting from among m total features. We proposed an extension of this criterion, called MIC, for the 
case of multiple responses. By efficiently coding the locations of feature-response pairs for features associated with multiple 
responses, MIC allows for sharing of information across responses during feature selection. The method is competitive with, 
and sometimes outperforms, existing multitask learning algorithms in terms of prediction accuracy, while achieving generally 
sparser, more interpretable models. 

MDL can also be viewed in the domain of hypothesis testing as a way of correcting against alpha inflation in multiple tests. 
We explained how MDL regression applied to each feature separately can be interpreted along the lines of standard Bonferroni 
and Benjamini-Hochberg feature-selection procedures. Again using the MIC approach, we extended this to hypothesis testing 
with multiple responses, allowing for greater power in selecting true coefficients when features are significantly shared across 
responses. 



22 



A MDL and Hypothesis Testing 



Information theory describes an isomorphism between probabiHties and code lengths in which the ideahzed code length of 
some symbol x is — IgP(x) [Gru05, p. 28]. There is a similar, though only approximate, relationship between MDL and 
the process of statistical hypothesis testing: namely, that MDL will tend to introduce an extra parameter into a model in 
roughly the same cases that a hypothesis test would reject the null hypothesis that the parameter is zero. This is perhaps 
unsurprising, because MDL is a tool for model selection, and hypothesis testing is about rejecting models that fit the data 
poorly. 



A.l Generalized Likelihood Ratio Tests 

Given two models Aio and Mi and data T>, statisticians define A to be the ratio of their likelihoods: 

A - ^(^1-^°) (31) 

In many cases, these models correspond to the same probability density function with different parameter settings. For 
instance, in regression of a single response y on a single feature x (with no intercept for simplicity): 

y = (3x + e, er^Af{0,a'^), (32) 

A4o might specify that, given x, y is distributed Af{0,a^), i.e., that /? = 0, while Aii might say that given x, y is distributed 
A/'(/3a;, CT^), where (3 is the maximum-likelihood estimate for (3 derived from the training data. Note that Mi here depends 
on the observed data V. Some authors prefer to speak of fixed probability distributions Me with parameters 9 ranging over 
different parameter spaces Oo and Qi [LM05, p. 463], in which case 



sup PiVlMe) 

^ ^ 9geo 

sup P{V\Me)' 

eeei 



In the regression example, Qq = {0} while Qi — M, the entire real line. 

In fact, it will be convenient to consider — IgA, which is large when A is small. In some sense, — IgA measures the 
"badness of fit" of Mq relative to Mi- Over different data sets V, —IgA will take on different values, according to some 
probability distribution. If Mq is true, — In A is unlikely to be very large, so we can define a threshold Ta such that 

P{-lgA>Ta,\Mo)^a (33) 

and reject Mq whenever — In A exceeds this threshold. This is known as a generalized likelihood ratio test (GLRT), of which 
many standard statistical hypothesis tests are examples, including the t-test on the significance of a regression coefficient 
[LM05, p. 685]. 



A.2 Model Selection 

Consider the approach that MDL would take. It "rejects" Mq in favor of Ali just in the case that the description length 
associated with A^i is shorter: 



e{Mi) + £{V I Ml) < £{Mo) + e{V I Mo), (34) 
where £{■) stands for description length. By the discussion in section 1.2.1, £{T> \ Mi) = — Ig P{T> \ Mi), so that (34) becomes 

£{Mi) - £{Mo) < lgP{V \Mi) - Ig P{V I Mo), 



or 



^lgA> £{Mi) -£{Mo). (35) 

If A^i is more complicated than Mo, then £{Mi) — £{Mo) will be positive, and there will be some a at which £{Mi)—£{Mo) — 
Ta as defined in (33). Thus, selecting between Mo and Mi is equivalent to doing a likelihood ratio test at the imphcit 
significance level a determined by £{Mi) — ^(A^o)- 
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A. 3 Example: Single Regression Coefficient 

While MDL introduces the quantity — IgA, it is more common in statistics to deal instead with —2 In A — — (21n2)lgA 
because of the following result: If AIq and Mi both belong to the same type of probability distribution satisfying certain 
smoothness conditions, and if Mq has parameter space &q with dimensionality dim(Oo) while Mi has parameter space Qi 
with dimensionality dim(Oi) > dim(0o), then under Mi, —(2 In 2) IgA is asymptotically chi-square with degrees of freedom 
equal to dim(0i) — dim(0o) [Ric95, sec. 9.5]. In particular, if the probability distribution is normal (as in our regression 
model), then the chi-square distribution will be not just asymptotic but exact. 

In (32), where we have a single regression coefficient, dim(0i) — dim(Oo) = 1, so that under Mi, —(2 In 2) IgA has a 
chi-square distribution with 1 degree of freedom, which is the same as the distribution of the square of a standard normal 
random variable Z ^ A/'(0, 1). Thus, we can rewrite (33) as 



P(-(21n2)lgA > (21n2)r„ I A^o) = a ^ P(Z^ > (2 In 2)r„) = a 

^ P{Z < -v/(21n2)T„) = 

a 



a 



2 (36) 



<i>(-x/(21n2)r„)= ^, 



1 ^ 

where <I> is the cumulative distribution function of the standard-normal distribution. Using the approximation <I>(— x) ~ 
for relatively large x > 0,^^ (36) becomes 

1 / (21n2)T^\ a ln{2a) 

A. 4 Bonferroni Criterion 

The Bonferroni criterion in p- value space says "reject Hq when the p- value is less than By (33), we can express this in 
log- likelihood space by saying "reject Hq when — IgA > Ti^," where, by (38), 

=lgm-lga- 1. (39) 

Note the similarity to the decision criterion (28). 
A. 5 Closeness of the Approximation 

It's worth reflecting on why these correspondences are only approximate. Indeed, if code length is just negative log probability, 
and a p-value p is a, probability, then, say, the Bonferroni rule of rejecting when p < — is equivalent to rejecting when 
— \gp > Igm — Iga, which looks basically the same as (39). However, A, which is used by MDL, is not exactly equal to p; 
the former is a comparison of a probability density function at two points, while the latter is an area under the probability 
density function in extreme regions. Still, the two are often quite close in practice. 
How close? Suppose we observe some value A* for A. The associated p-value is 

p = P{A< A, I Mo) = P(-21nA > -21nA, |7Wo). 

^*This follows from approximating 



K-) = / 7= 



e 2 dt 

2lT 



by the value of the integrand at t = —x, assuming — i ^ 



4 ■ 



Another way to see the approximation is as follows. [P6145, pp. 63-67] proved that for a; > 0, 

0.5 - <S>{-x) ^ i ^1-cxp (37) 

If X is large enough that a:)-^ is negligible compared with <!>(— x), then we can multiply by 2 and square both sides of (37) to give 

/ 2x^ \ 1 / 2x^ 
1 — 4<I>(— x) Ri 1 — cxp ( ) <!>(— x) ft! - exp { 

\ TT J 4 V TT 

Assuming tt ss 4 gives the result. 

Note that x) ft — -i=e~T tends to give a tighter approximation, especially for x bigger than ~ 2, but it's less mathematically convenient 
here. 

I credit [Sti, p. 17] with inspiring the approach of using an approximation to <I> to compare hypothesis testing and coding costs. 
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If -21nA xfk) under Ma, 

P=l-i^x^,(-21nA,), (40) 

where F stands for the cumulative distribution function. Figure 3 compares A* to the associated p- value for the case of fc = 1 
degree of freedom; a straight, 45-degree line would be a perfect match, but the approximation is generally correct to within 
a factor of 2 or 3. 

We can use this curve to compute the a implied by a more comphcated model Mi in (35). Letting Ac :— i{Mi) — i{Mo), 
(35) and (40) give 

a^l~F^2 ((2 In 2) Ac). 
Table 7 shows examples for = 1 degree of freedom, a — 0.05 is achieved at Ac = 2.77 bits.^^ 



Ac (bits) 12 3 4 
Implied a 0.24 0.1 0.04 0.02 



Table 7: Example implied a values against the increase in coding cost Ac of the more complicated model, for fc = 1 degree 
of freedom. 




Likelihood ratio (Lambda*) 



Figure 3: The actual p- value P(A < A, | Alo) vs. A, for fc = 1 degree of freedom. 
A.6 BH-Style Penalty 

Suppose we regress y on each of m features separately. We would obtain m p-values, to which we could apply the BH step-up 
procedure (25). Alternatively, we could evaluate, for each feature j, a negative log-likehhood ratio: 

-lgA, = -lgP(y|Mo)-(-lgP(>^|My)), 

where M\j denotes the alternative hypothesis in which feature j has a nonzero, maximum- likelihood coefficient. 
In the same way that we obtained (39), we can derive a BH-type rejection level of 

Ti2_ = Ig TO - Ig j - Ig a - 1. 

■^^[BL05, p. 582] present a very similar discussion, only with log base e and with their A equal to ^ . The numerical values given are equivalent. 
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We can now restate the BH procedure as follows: Put the — IgAj in decreasing order, and let —IgA^) be the j biggest. 
Reject . . . , such that 

q = max{j : -lgA(j) > Ig m - Ig j - Ig a - 1} . (41) 
We could rephrase (41) as choosing the q that maximizes 

J2 ^U) - Ig TO + Ig J + Ig a + 1) 

or that minimizes 

9 q 
- ^ (- Ig A(,) - Ig m + Ig J + Ig « + 1) = Ig ^ - g(lg a + 1) + ^ (Ig P(y I A^o) - Ig P{Y | XiO"))) • (42) 

Since lgP(l' | A^o) is a constant with respect to q, we can subtract it m times from (42) to give 

lg"V-g(lg«+l)+E-lg^(^l-^iO))+ E -lg^(>'l>'o)=lg"V-'7(lg«+l) + 2'"'(>^l^,) (43) 

J = l 5=9+1 ^■ 

with 2?"'(y I (3q) as in (26), in which Pq is the model containing the q features with highest negative log- likelihood ratio values. 
Now, if m ^ (7, then m"^ « {rn^q)'. ' ^^'^ (^"^^ becomes 

777 1 ^ / 777 \ ^ 

Ig „ • -g(lga+l) + I?'"(y|/3,)^lg )-q{lga+l)+V"^{Y\f3q), (44) 
g!(m-g)! \qj 

in very close analogy to (29).^^ 
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